**Data Parallel C++ Essentials** 

# oneAPI VIRTUAL WORKSHOP

# Praveen Kundurthy

What is oneAPI and Data Parallel C++?



# Introduction to oneAPI

### • Agenda

- a) Introduction & Overview to oneAPI
- b) Introduction to the Intel® DevCloud
- c) Introduction to Jupyter notebooks used for training
- d) Introduction to Data Parallel C++
- e) DPC++ Program Structure

### Hands On

- Introduction to DPC++ Simple
- Complex multiplication

# Learning Objectives

Explain how oneAPI can solve the challenges of programming in a heterogeneous world

Use oneAPI solutions to enable your workflows

Experiment with oneAPI tools and libraries on the Intel® DevCloud

Understand the Data Parallel C++ (DPC++) language and programming model

Use device selection to offload kernel workloads

Build a sample DPC++ application through hands-on lab exercises

#### Cross-Architecture Programming for Accelerated Compute, Freedom of Choice for Hardware

# oneAPI: Industry Initiative & Intel Products

One Intel Software & Architecture group Intel Architecture, Graphics & Software November 2020





### Programming Challenges for Multiple Architectures

Growth in specialized workloads

Variety of data-centric hardware required

Separate programming models and toolchains for each architecture are required today

Software development complexity limits freedom of architectural choice

| Applicat                    | Application Workloads Need Diverse Hardware |                              |                                       |  |
|-----------------------------|---------------------------------------------|------------------------------|---------------------------------------|--|
|                             |                                             |                              |                                       |  |
| Scalar                      | Vector                                      | Spatial                      | Matrix                                |  |
|                             | Middleware &                                | Frameworks                   |                                       |  |
|                             | 1 indiewale d                               | Traineworks                  |                                       |  |
|                             |                                             |                              |                                       |  |
| CPU<br>programming<br>model | GPU<br>programming<br>model                 | FPGA<br>programming<br>model | Other accel.<br>programming<br>models |  |
|                             |                                             |                              |                                       |  |
| يسبر                        | يبسبر                                       | يسبر                         | يبسبر                                 |  |
|                             |                                             |                              |                                       |  |
| CPU                         | GPU                                         | FPGA                         | Other accel.                          |  |
| XPUs                        |                                             |                              |                                       |  |

### Introducing ONEAPI

Cross-architecture programming that delivers freedom to choose the best hardware

Based on industry standards and open specifications

Exposes cutting-edge performance features of latest hardware

Compatible with existing high-performance languages and programming models including C++, OpenMP, Fortran, and MPI



### OneAPI Industry Initiative Break the Chains of Proprietary Lock-in

A cross-architecture language based on C++ and SYCL standards

Powerful libraries designed for acceleration of domain-specific functions

Low-level hardware abstraction layer

Open to promote community and industry collaboration

Enables code reuse across architectures and vendors



The productive, smart path to freedom for accelerated computing from the economic and technical burdens of proprietary programming models



#### Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

7

# Intel® oneAPI Toolkits

A complete set of proven developer tools expanded from CPU to XPU





### Toolkits powered by oneAPI

Data Scientists & Al Developers



#### Intel<sup>®</sup> AI Analytics Toolkit

Accelerate machine learning & data science pipelines with optimized DL frameworks & high-performing Python libraries intel OpenVINO Toolkit

#### Intel<sup>®</sup> Distribution of OpenVINO<sup>™</sup> Toolkit

Deploy high performance inference & applications from edge to cloud

Optimization Notice

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# Intel<sup>®</sup> oneAPI Base Toolkit Accelerate Data-centric Workloads

A core set of core tools and libraries for developing high-performance applications on Intel® CPUs, GPUs, and FPGAs.

#### Who Uses It?

- A broad range of developers across industries
- Add-on toolkit users since this is the base for all toolkits

#### Top Features/Benefits

- Data Parallel C++ compiler, library and analysis tools
- DPC++ Compatibility tool helps migrate existing code written in CUDA
- Python distribution includes accelerated scikit-learn, NumPy, SciPy libraries
- Optimized performance libraries for threading, math, data analytics, deep learning, and video/image/signal processing

#### Intel<sup>®</sup> oneAPI Base Toolkit **Direct Programming API-Based Programming** Analysis & debug Tools Intel<sup>®</sup> oneAPI DPC++ Library Intel<sup>®</sup> oneAPI DPC++/C++ Intel<sup>®</sup> VTune<sup>™</sup> Profiler Compiler oneDPL Intel<sup>®</sup> oneAPI Math Kernel Intel<sup>®</sup> DPC++ Compatibility Tool Intel<sup>®</sup> Advisor Library-oneMKL Intel<sup>®</sup> oneAPI Data Analytics Intel<sup>®</sup> Distribution for Python Intel<sup>®</sup> Distribution for GDB Library-oneDAL Intel<sup>®</sup> FPGA Add-on Intel<sup>®</sup> oneAPI Threading for oneAPI Base Toolkit Building Blocks - oneTBB Intel<sup>®</sup> oneAPI Video Processing Library-oneVPL Intel<sup>®</sup> oneAPI Collective Communications Library oneCCL intel Intel<sup>®</sup> oneAPI Deep Neural Network Library - oneDNN oneAPI Intel<sup>®</sup> Integrated Performance Primitives - Intel® IPP

#### Optimization Notice

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. Learn More: intel.com/oneAPI-BaseKit

intel

BASE TOOLKIT

# Intel<sup>®</sup> oneAPI Data Parallel C++ Library (oneDPL)

- Three components:
  - 1. Standard C++ APIs: Tested and supported within DPC++ kernels
  - 2. Parallel STL: C++17 algorithms extended with DPC++ execution policies
  - 3. STL Extensions: Additional algorithms, classes and iterators

```
sycl::queue q;
std::vector<int> v(N);
std::sort(oneapi::dpl::execution::make_device_policy(q), v.begin(), v.end());
```

Recommended for codes using C++17 algorithms, or libraries like Thrust

See https://spec.oneapi.com/versions/latest/elements/oneDPL/source/index.html

### Intel<sup>®</sup> DPC++ Compatibility Tool Minimizes Code Migration Time

Assists developers migrating code written in CUDA to DPC++ once,

generating **human readable** code wherever possible

~80-90% of code typically migrates automatically

Inline comments are provided to help developers finish porting the application

#### Intel DPC ++ Compatibility Tool Usage Flow



### Intel<sup>®</sup> VTune<sup>™</sup> Profiler DPC++ Profiling-Tune for CPU, GPU & FPGA

#### Analyze Data Parallel C++ (DPC++)

See the lines of DPC++ that consume the most time

#### Tune for Intel CPUs, GPUs & FPGAs

Optimize for any supported hardware accelerator

#### Optimize Offload

Tune OpenMP offload performance

#### Wide Range of Performance Profiles

CPU, GPU, FPGA, threading, memory, cache, storage...

#### Supports Popular Languages

DPC++, C, C++, Fortran, Python, Go, Java, or a mix

| So  | urce Assembly 💵 = 😽 👉 🛻             | Q <u>**</u>                                                                                                                                                                                                      |
|-----|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 🛦   | Source                              | <ul> <li>♦ GPU Instructions Executed by Instruction T<sup>≫</sup></li> <li>■ Control Flow</li> <li>■ Send &amp; Wait</li> <li>■ Int32 &amp; SP Float</li> <li>■ Int64 &amp; DP Float</li> <li>■ Other</li> </ul> |
| 158 | dx = ptr[j].pos[0] - ptr[i].pos[0]  | 75,002,500                                                                                                                                                                                                       |
| 159 | dy = ptr[j].pos[1] - ptr[i].pos[1]  | 12,500,000                                                                                                                                                                                                       |
| 160 | dz = ptr[j].pos[2] - ptr[i].pos[2]  | 12,500,000 📒                                                                                                                                                                                                     |
| 161 |                                     |                                                                                                                                                                                                                  |
| 162 | distanceSqr = dx*dx + dy*dy + dz*d  | 87,500,000                                                                                                                                                                                                       |
| 163 | distanceInv = 1.0 / sqrt(distanceSo | 12,500,000 📒                                                                                                                                                                                                     |
| 164 |                                     |                                                                                                                                                                                                                  |
| 165 | ptr[i].acc[0] += dx * G * ptr[j].ma | 162,503,750                                                                                                                                                                                                      |
| 166 | ptr[i].acc[1] += dy * G * ptr[j].ma | 150,000,000                                                                                                                                                                                                      |
| 167 | ptr[i].acc[2] += dz * G * ptr[j].ma | 150,000,000                                                                                                                                                                                                      |



There will still be a need to tune for each architecture.

# Intel<sup>®</sup> Advisor

**Design Assistant - Design for Modern Hardware** 

#### Offload Advisor

Estimate performance of offloading to an accelerator

#### **Roofline Analysis**

Optimize CPU/GPU code for memory and compute

#### Vectorization Advisor

Add and optimize vectorization

#### Threading Advisor

Add effective threading to unthreaded applications

#### Flow Graph Analyzer

Create and analyze efficient flow graphs





Optimization Notice

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# SETUP INTEL® DEVCLOUD AND JUPYTER ENVIRONMENT

# Intel<sup>®</sup> devcloud for oneAPI

- A development sandbox to develop, test and run workloads across a range of Intel CPUs, GPUs, and FPGAs using Intel<sup>®</sup> oneAPI beta software
- A fast way to start coding ۲
- Try the oneAPI toolkits, compilers, performance libraries, and tools
- Get 120 days of free access to the latest Intel® hardware • and oneAPI software
- No downloads; No hardware acquisition; No installation ۲



The Intel® DevCloud is a development sandbox to learn about First Name and program oneAPI cross-architecture applications Sign up now for full access to the latest Intel® CPUs, GPUs, Last Name and FPGAs, Intel® oneAPI Toolkits, and the new programming Email Address \* Country / Region - Select -Company or University oneA What is the CPU



Which hardware and accelerator architectures are you developing for? (Select all that apply.) BASICs (application-specific integrated circuits) FPGA (field-programmable gate array) GPGPU (general-purpose GPU

#### **Optimization Notice** Copyright © 2019. Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

#### A Fast Way to Start Coding

Are you a forward-thinking developer interested in the next

ou've come to the right place

language, Data Parallel C++ (DPC++).

Access is free for 120 days, and extensions are totally



# Register to Devcloud

 Step 1: Register or Sign into Intel Developer Zone



To get an Intel® DevCloud account, you must first create a Basic Intel® Account



### Step 2: Activate Intel Devcloud Account

#### Step 2: Activate Intel® DevCloud for oneAPI

To get free access, tell us a bit more about yourself and how you would like to use the Intel DevCloud.

| Required Fields(*)                                                                                                                                                                                                                                               |                                                         |   |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------|---|
| * First Name                                                                                                                                                                                                                                                     | *Country/region                                         |   |
| First Name                                                                                                                                                                                                                                                       | Please select a country/region                          | T |
| *Last Name                                                                                                                                                                                                                                                       | *Company or University                                  |   |
| Last Name                                                                                                                                                                                                                                                        | Company or Academic Institution                         |   |
| *Email Address                                                                                                                                                                                                                                                   | *What type of developer are you?                        |   |
| Business Email                                                                                                                                                                                                                                                   | -Select-                                                |   |
| * Which hardware and accelerator architecture are you developing for?(Select all that<br>apply)<br>ASICSs (application-specific integrated circuits)<br>CPU<br>FPGA (field-programmable gate array)<br>GPGPU (general-purpose GPU)<br>GPU<br>Integrated Graphics | Do you have an event code provided by Intel? (Optional) |   |

# Get Started with Devcloud

Step 3: Click on Get Started button

Intel<sup>®</sup> DevCloud for oneAPI

Get Started Documentation Forum 🖉

### Step 4: Scroll Down to the bottom of the page and click on Launch JupyterLab

#### Explore Intel oneAPI Toolkits in the DevCloud

These toolkits are for performance-driven applications—HPC, IoT, advanced rendering, deep lear toolkit to see what it includes, explore training modules, and go deeper with developer guides.



#### Connect with Jupyter\* Lab



#### Connect with Jupyter\* Notebook

Use Jupyter Notebook to learn about how oneAPI can solve the challenges of programming in a heterogeneous world and understand the Data Parallel C++ (DPC++) language and programming model.



Overview

### Setup Intel<sup>®</sup> DevCloud and Jupyter Environment

# Launch Jupyter and select Terminal



Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# Commands to input in terminal

Please execute the following commands in the Jupyter Terminal window

# /data/oneapi\_workshop/get\_jupyter\_notebooks.sh

This command copies workshop into the user directory

⊾ u30109@s001-n004: ~ ×

u30109@s001-n004:~\$ /data/oneapi\_workshop/get\_jupyter\_notebooks.sh

# Select Welcome.ipynb

|                       | +                       | ∎ <b>÷</b>     | <b>±</b> | C             |
|-----------------------|-------------------------|----------------|----------|---------------|
| / oneAPI_Essentials / |                         |                |          |               |
| 0                     | Name                    |                |          | Last Modified |
|                       | 00_Introduction_to_Ju   |                |          | 2 months ago  |
| Ο                     | 🖿 01_oneA               | PI_Intro       |          | 2 months ago  |
| •                     | 02_DPC                  | PP_Program_St  |          | 2 months ago  |
| щQ                    | 03_DPCI                 | PP_Unified_Sha |          | 2 months ago  |
| ß                     | 04_DPC                  | PP_Sub_Groups  |          | 2 months ago  |
|                       | 05_Intel_Advisor        |                |          | 2 months ago  |
| ° <b>¢</b>            | 06_Intel_VTune_Profiler |                |          | 2 months ago  |
|                       | 07_DPCPP_Library        |                |          | 2 months ago  |
|                       | README.md               |                |          | 3 months ago  |
|                       | • 🗔 Welcom              | e.ipynb        |          | 2 months ago  |
| ≣                     |                         | \<br>J         |          |               |

🗷 Welcome.ipynb 🛛 🗙

+ 🛠 🖆 📋 🕨 🔳 C Markdown 🗸

#### oneAPI Essentials Modules

The concepts build on top of each other introducing and reinforcing the concepts of Data Parallel C++.

#### Module 0 - Introduction to Jupyter Notebook (Optional)

Optional This module explains how to use Jupyter Notebook which is used in all of the modules to edit and run coding excecises, this can be skipped if you are already familiar with using Jupyter Notebooks.

#### Module 1 - Introduction to oneAPI and DPC++ ¶

These initial hands-on exercises introduce you to DPC++ and the goal of oneAPI. In addition, it familiarizes you with the use of Jupyter notebooks as a front-end for all training exercises. This workshop is designed to be used on the DevCloud and includes details on how to submit batch jobs on DevCloud environment.

#### Module 2 - DPC++ Program Structure

These hands-on exercises present six basic DPC++ programs that illustrate the elements of a DPC++ application. You can modify the source code in some of the exercises to become more familiar with DPC++ programming concepts.

Python 3.7 (Intel® oneAPI)

# **DPC++essentials** Course

Sing

acc

UUUU

huffe

paral

Fam



| ource<br>ode and heterogeneou<br>ator kernels can be<br>in same source files<br>C++ | IS<br>Flast of             | <pre>sinclude <cl'syst.hpp> winclude <cl'syst.hpp> constepp int numsl6; using namespace cl::syst; int main() {    suto R = range(1){ num };    buffer(int) A( R );    queue[].submit([&amp;](handler&amp; h) {       suto ut =    } }</cl'syst.hpp></cl'syst.hpp></pre> |
|-------------------------------------------------------------------------------------|----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| constructs add<br>nality, such as:                                                  | Accelerator<br>device code | <pre>A.get_access<access::mode::write>(h);<br/>h.parallel_for(R, [=](id&lt;1&gt; idx) {<br/>out[idx] = idx[0]; }); });</access::mode::write></pre>                                                                                                                      |
| ct Purpose                                                                          | Contraction of             | auto result a                                                                                                                                                                                                                                                           |
| Work targeting                                                                      |                            | A.get_access(access::mode::read>();                                                                                                                                                                                                                                     |
| Data<br>management                                                                  |                            | <pre>for (int i=0; i<num; "\n";<="" ++i)="" <<="" pre="" result[i]="" std::cout=""></num;></pre>                                                                                                                                                                        |
| or Parallelism                                                                      |                            | return 0;                                                                                                                                                                                                                                                               |

#### ND RANGE KERNEL EXECUTION

Parallel execution with ND\_RANGE Kernel helps to group work items that maps to hardware resources. This helps to tune applications for performance.



# INTEL OFFLOAD ADVISOR (BETA) INTEL VIO • Starting from a baseline binary (running on CPU)! • Helps defining which sections of the code should run on a given accelerator: • Orovides performance projection on accelerators (currently gen9 and gen1): • Provides performance projection on accelerators (currently gen9 and gen1): DPC++ kernels and their hardware metrics: • OPU hardware metrics GPU hardware metrics: • OPU queue and utilization



DPC++ Essentials Course Curriculum provides 20 hours of training and exercises using Jupyter Notebooks integrated with Intel® DevCloud

#### **Optimization Notice**

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# Qsub

- qsub can be used to submit jobs to the DevCloud job queue
- Jobs run asynchronously and report status upon completion
- The traditional way to execute qsub is to pass it a script: "qsub <script.sh>"
- qsub requires absolute paths, e.g. /bin/ls
- qsub –w \$PWD Runs in current folder
- Output file is <scriptname>.o<jobid>

# **QSTAT/QDEL**

- qstat displays running jobs
- qdel <jobid> deletes pending jobs

| u42485@s001-n003:~\$ qstat<br>Job ID | Name          | User       | Time Use S Queue      |
|--------------------------------------|---------------|------------|-----------------------|
| 591829.v-qsvr-1                      | ub-singleuser | <br>u42485 | 00:01:06 R jupyterhub |
| 591832.v-qsvr-1                      | STDIN         | u42485     | 0 R batch             |
| 591833.v-qsvr-1                      | STDIN         | u42485     | 0 R batch             |
| 591834.v-qsvr-1                      | STDIN         | u42485     | 0 R batch             |
| 591835.v-qsvr-1                      | STDIN         | u42485     | 0 R batch             |
| u42485@s001-n003:~\$ qdel            | 591835        |            |                       |

# **Interactive shells**

- Getting an interactive shell
  - qsub –l
- Requesting an iGPU/FPGA node
  - qsub -I -l nodes=1:gpu:ppn=2
  - clinfo lists iGPU info

# Hands-on Coding on Intel DevCloud

### Run Simple DPC++ Program

# Data Parallel C++

Standards-based, Cross-architecture Language DPC++ = ISO C++ and Khronos SYCL

# Parallelism, productivity and performance for CPUs and Accelerators

- Delivers accelerated computing by exposing hardware features
- Allows code reuse across hardware targets, while permitting custom tuning for specific accelerators
- Provides an open, cross-industry solution to single architecture proprietary lock-in

#### Based on C++ and SYCL

- Delivers C++ productivity benefits, using common, familiar C and C++ constructs
- Incorporates SYCL from the Khronos Group to support data parallelism and heterogeneous programming

#### Community Project to drive language enhancements

- Provides extensions to simplify data parallel programming
- Continues evolution through open and cooperative development

### Apply your skills to the next innovation, not rewriting software for the next hardware platform



#### **Optimization Notice**

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others. The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs. Codeplay announced a <u>DPC++ compiler that targets Nvidia GPUs</u>.

# What is Data Parallel C++?

- Data Parallel C++
  - = C++ and SYCL\* standard and extensions
- Based on modern C++
- C++ productivity benefits and familiar constructs

Standards-based, cross-architecture

 Incorporates the SYCL standard for data parallelism and heterogeneous programming

# DPC++ Extends SYCL\* standard

### **Enhance Productivity**

- Simple things should be simple to express
- Reduce verbosity and programmer burden

#### Enhance Performance

- Give programmers control over program execution
- Enable hardware-specific features

### DPC++: Fast-moving open collaboration feeding into the SYCL\* standard

- Open source implementation with goal of upstream LLVM
- DPC++ extensions aim to become core SYCL\*, or Khronos\* extensions

# A Complete DPC++ Program

#### Single source

 Host code and heterogeneous accelerator kernels can be mixed in same source files

#### Familiar C++

 Library constructs add functionality, such as:

| Construct     | Purpose         |  |
|---------------|-----------------|--|
| queue         | Work targeting  |  |
| malloc_shared | Data management |  |
| parallel_for  | Parallelism     |  |

#### #include <CL/sycl.hpp> constexpr int N=16; using namespace sycl; int main() { queue q; int \*data = malloc shared<int>(N, q); q.parallel\_for(N, [=](auto i) { Accelerator device code data[i] = i; }).wait(); for (int i=0; i<N; i++) std::cout << data[i] << "\n";</pre> free(data, q); return 0;

#### Optimization Notice

Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

# DPC++ Program Structure

### • Agenda

- Deciding where code is run
- Data transfers and synchronization
- DPC++ execution model and memory model
- Hands On
  - Complex Multiplication

# Buffer Memory Model

Buffers encapsulate data shared between host and device.

Accessors provide access to data stored in buffers and create data dependences in the graph.

Unified Shared Memory (USM) provides an alternative pointerbased mechanism for managing memory;

```
queue q;
std::vector<int> v(N, 10);
{
  buffer buf(v);
  q.submit([&](handler& h) {
    accessor a(buf, h , write_only);
    h.parallel_for(N, [=](auto i) { a[i] = i; });
  });
}
for (int i = 0; i < N; i++) std::cout << v[i] << " ";</pre>
```

# Important Classes in DPC++

| Class                                                                      | Functionality                                                                                                               |
|----------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
| sycl::device                                                               | Represents a specific CPU, GPU, FPGA or other device that can execute SYCL kernels.                                         |
| sycl::queue                                                                | Represents a queue to which kernels can be<br>submitted (enqueued).<br>Multiple queues may map to the same<br>sycl::device. |
| sycl::buffer                                                               | Encapsulates an allocation that the runtime can transfer between host and device.                                           |
| sycl::handler                                                              | Used to define a command-group scope that connects buffers to kernels.                                                      |
| sycl::accessor                                                             | Used to define the access requirements of specific kernels (e.g. read, write, read-write).                                  |
| <pre>sycl::range, sycl::nd_range sycl::id, sycl::item, sycl::nd_item</pre> | Representations of execution ranges and individual execution agents in the range.                                           |

## **Accessor Modes**

| Access Mode | Description                                        |
|-------------|----------------------------------------------------|
| read_only   | Read only Access                                   |
| write_only  | Write-only access. Previous contents not discarded |
| read_write  | Read and Write access                              |

# DPC++ Code Anatomy



#### Copyright © 2019, Intel Corporation. All rights reserved

\*Other names and brands may be claimed as the property of others.

# Submitting to a Device

- A **device** represents a specific accelerator in the system.
- Work is not submitted to devices directly, but to a queue associated with the device.
- Creating a queue for a specific device requires a **device\_selector**.

```
default_selector selector;
// host_selector selector;
// cpu_selector selector;
// gpu_selector selector;
queue q(selector);
std::cout << "Device: " << q.get device().get info<info::device::name>() << std::endl;</pre>
```

#### Parallel Kernels

- Parallel Kernel allows multiple instances of an operation to execute in parallel.
- Useful to offload parallel execution of a basic for-loop in which each iteration is completely independent and in any order.
- Parallel kernels are expressed using the parallel\_for function

for-loop in CPU application

Offload to accelerator using parallel\_for

#### Basic Parallel Kernels

## The functionality of basic parallel kernels is exposed via range, id and item classes

- range class is used to describe the iteration space of parallel execution
- id class is used to index an individual instance of a kernel in a parallel execution
- item class represents an individual instance of a kernel function, exposes additional functions to query properties of the execution range

h.parallel\_for(range<1>(1024), [=](id<1> idx){
 // CODE THAT RUNS ON DEVICE
});

```
h.parallel_for(range<1>(1024), [=](item<1> item){
    auto idx = item.get_id();
    auto R = item.get_range();
    // CODE THAT RUNS ON DEVICE
});
```

#### ND-Range Kernels

Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level.

ND-Range kernel is another way to expresses parallelism which enable low level performance tuning by providing access to local memory and mapping executions to compute units on hardware.

- The entire iteration space is divided into smaller groups called work-groups, work-items within a work-group are scheduled on a single compute unit on hardware.
- The grouping of kernel executions into work-groups will allow control of resource usage and load balance work distribution.



#### ND-Range Kernels

The functionality of nd\_range kernels is exposed via nd\_range and nd\_item classes



- nd\_range class represents a grouped execution range using global execution range and the local execution range of each work-group.
- nd\_item class represents an individual instance of a kernel function and allows to query for work-group range and index.

#### Asynchronous Execution



Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. \*Other names and brands may be claimed as the property of others.

#### Asynchronous Execution



Optimization |

#### Synchronization – Host Accessors

```
#include <CL/sycl.hpp>
using namespace sycl;
constexpr int N = 16;
```

```
int main() {
   std::vector<double> v(N, 10);
   queue q;
```

```
buffer buf(v);
q.submit([&](handler& h) {
    accessor a(buf, h)
    h.parallel_for(N, [=](auto i) {
        a[i] -= 2;
    });
});
```

```
host_accessor b(buf, read_only);
for (int i = 0; i < N; i++)
   std::cout << b[i] << "\n";
return 0;</pre>
```

Buffer takes ownership of the data stored in vector.

Creating host accessor is a blocking call and will only return after all enqueued kernels that modify the same buffer in any queue completes execution and the data is available to the host via this host accessor.

#### Synchronization – Buffer Destruction

```
#include <CL/sycl.hpp>
using namespace sycl;
constexpr int N=16;
```

```
void dpcpp_code(std::vector<double> &v, queue &q){
   buffer buf(v);
   q.submit([&](handler& h) {
      accessor a(buf, h);
      h.parallel_for(N, [=](auto i) {
        a[i] -= 2;
      });
   });
}
```

```
int main() {
   std::vector<double> v(N, 10);
   queue q;
   dpcpp_code(v,q);
   for (int i = 0; i < N; i++)
      std::cout << v[i] << "\n";
   return 0;
}</pre>
```

## Buffer creation happens within a separate function scope.

When execution advances
beyond this function scope,
buffer destructor is invoked
which relinquishes the ownership
of data and copies back the data
to the host memory.

#### **Custom Device Selector**

The following code shows derived **device\_selector** that employs a device selector heuristic. The selected device prioritizes a GPU device because the integer rating returned is higher than for CPU or other accelerator.

```
#include <CL/sycl.hpp>
using namespace cl::sycl;
class my_device_selector : public device_selector {
public:
  int operator()(const device& dev) const override {
   int rating = 0;
   if (dev.is_gpu() & (dev.get_info<info::device::name>().find("Intel") != std::string::npos))
      rating = 3;
   else if (dev.is gpu()) rating = 2;
   else if (dev.is cpu()) rating = 1;
   return rating;
 };
};
int main() {
 my device selector selector;
 queue q(selector);
  std::cout << "Device: " << q.get device().get info<info::device::name>() << std::endl;</pre>
  return 0;
```

#### Hands-On: Complex Number Multiplication

- In this lab we provide with the source code that computes multiplication of two complex numbers where Complex class is the definition of a custom type that represents complex numbers
- In this example the student will learn how to create a custom device selector and to target GPU or CPU of a specific vendor. The student will also learn how to pass in a vector of custom Complex class objects in parallel and needs to modify the source code to setup a write accessor and call the Complex class member function as kernel to compute the multiplication

### Hands-on Coding on Intel DevCloud

Complex Multiplication with DPC++



#### Recap

- oneAPI solves the challenges of programming in a heterogeneous world
- Take advantage of oneAPI solutions to enable your workflows
- Use the Intel<sup>®</sup> DevCloud to test-drive oneAPI tools and libraries
- Introduced to DPC++ language and programming model
- Important Classes for DPC++ application
- Device selection and offloading kernel workloads
- DPC++ Buffers, Accessors, Command Group handler, lambda code as kernel



- This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.
- The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. No product or component can be absolutely secure. Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.
- Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit <u>www.intel.com/benchmarks</u>.
- INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
- Copyright ©, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and OpenVINO are trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

#